Leveraging Newswire Treebanks for Parsing Conversational Data with Argument Scrambling
نویسندگان
چکیده
We investigate the problem of parsing conversational data of morphologically-rich languages such as Hindi where argument scrambling occurs frequently. We evaluate a state-of-the-art non-linear transitionbased parsing system on a new dataset containing 506 dependency trees for sentences from Bollywood (Hindi) movie scripts and Twitter posts of Hindi monolingual speakers. We show that a dependency parser trained on a newswire treebank is strongly biased towards the canonical structures and degrades when applied to conversational data. Inspired by Transformational Generative Grammar (Chomsky, 1965), we mitigate the sampling bias by generating all theoretically possible alternative word orders of a clause from the existing (kernel) structures in the treebank. Training our parser on canonical and transformed structures improves performance on conversational data by around 9% LAS over the baseline newswire parser.
منابع مشابه
Web-scale Surface and Syntactic n-gram Features for Dependency Parsing
We develop novel firstand second-order features for dependency parsing based on the Google Syntactic Ngrams corpus, a collection of subtree counts of parsed sentences from scanned books. We also extend previous work on surface n-gram features from Web1T to the Google Books corpus and from first-order to second-order, comparing and analysing performance over newswire and web treebanks. Surface a...
متن کاملHeterogeneous Parsing via Collaborative Decoding
There often exist multiple corpora for the same natural language processing (NLP) tasks. However, such corpora are generally used independently due to distinctions in annotation standards. For the purpose of full use of readily available human annotations, it is significant to simultaneously utilize multiple corpora of different annotation standards. In this paper, we focus on the challenge of ...
متن کاملLanguage Independent Dependency to Constituent Tree Conversion
We present a dependency to constituent tree conversion technique that aims to improve constituent parsing accuracies by leveraging dependency treebanks available in a wide variety in many languages. The technique works in two steps. First, a partial constituent tree is derived from a dependency tree with a very simple deterministic algorithm that is both language and dependency type independent...
متن کاملA Parallel Chart Parser for German, Its Argument Interpretation Strategy, and the Treatment of Innnitives
1 Abstract In this paper we present a GB-parsing system for German and in particular its strategy for argument interpretation, which copes with the diiculty that word order is relatively free in German (due to scrambling) and also that the arguments can precede their predicate. In this case, the parser makes a provisional interpretation, which is checked when the argument structure of the predi...
متن کاملChinese Statistical Parsing
This chapter describes several issues that are fundamental to achieving accurate Chinese parsing given available Chinese resources and the challenges of the Gale processing pipeline. For Gale, our parsing algorithm is expected to accurately parse various different materials, ranging from newswire text, which tends to be grammatically well formed, to n-best ASR outputs, many of which are poorly ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017